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1 INTRODUCTION 



In standard clinical trial designs, the sample size is determined by the power at a given 
alternative (e.g., treatment effect). In practice, especially for new treatments about which 
there is little information on the magnitude and sampling variability of the treatment effect, 
it is often difficult for investigators to specify a realistic alternative at which sample size de- 
termination can be based. Therefore, the problem of sample size re-estimation based on an 
observed treatment difference at some time before the prescheduled end of the trial h as at 



tracted considerable attention during the past dec ade; s ee e.g. Jennison and TurnbuU ( 12000 



Section 14.2), Shih (j200l[ ) and Whitehead et al. ( 1200 ll ). Moreover, there are concerns from 
the regulatory perspective regarding possible inflation of the type I error probability when 
such sample size adjustments are used in pharmaceutical t rials. For normally distr ibute d 
outcome variables, Pr oscha n and Hunsberger (119951 ) . Fisher (Il998l ). Posch and Bauer (119991 ) . 
and Shen and Fisher (119991 ) have proposed ways to adjust the test statistics after mid-course 
sample size modification so th at the type I error probability is maintained at the prescribed 
level. Jennison and TurnbuU ( 120031 ) gave a general form of these methods and sh owed t hat 
they performed considerably worse than group sequential tests. Tsiatis and Mehta (|2003[ ) in- 
dependently came to the same conclusion, pointing out their inefficiency because the adjusted 
test statistics are not sufficient statistics. It is possible to adhere to efficient generalized like- 
lihood ratio statistics in a mid-course adaptive design if one uses the non-normal sampling 
distribution (due to the mid-course adaptation) of the test statistic, instead of ignoring the 
nonnormalit y and thereby resulting in type I erro r infla tion. A way to do this was proposed 



20061) to be relatively inefficient compared 



2006al ) recently introduced adaptive group 



by Li et al. ( 120021 ). but it was shown by TurnbuU 
to group sequential tests. Jennison and TurnbuU i 
sequential tests that choose the jth group size and stopping boundary on the basis of the 
cumulative sample size and the sample sum Snj_i over the first j — 1 groups, and that 
are optimal in the sense of minimizing a weighted average of the expected sample sizes over 
a collection of parameter values subject to prescribed error probabilities at the null and a 
given alternative hypothesis. They also showed how the corresponding optimization problem 
can be solved numerically by using the backwa rd ind uction algorithms for "optimal sequen- 
tially planned" designs developed by Schmitz (119931 ). Jennison and TurnbuU (l2006d ) found 
that standard (non-adaptive) group sequential tests with the first stage chosen optimally 
are nearly as efficient as their optimal adaptive counterparts that are considerably more 
complicated, and we use these as a benchmark for our comparisons in Section [31 
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With the goal of achieving similar efficiency in more complicated situations where the 
alternative of interest and/or nuisance parameters are not known, we give in Section [2] a 
simple adaptive test which updates the sample size after the initial stage by using estimates 
of the unknown parameters and adjustments for the uncertainty of these estimates. This 
is done ffist for the one-parameter case in Section 12.11 and extended to the multiparameter 
setting in Section 12.31 These tests usually terminate at the first or second stage, but allow 
the possibility of a third stage to account for uncertainties in the second-stage sample size 
estimate. The tests control the type I error probability and have power close to the uniformly 
most powerful fixed sample test. Section [3] gives a comprehensive simulation study, which 
is the ffist of its kind, of the adaptive tests in the aforementioned references and compares 
them with the adaptive tests developed in Section [2] and with fixed sample size and standard 
group sequential tests having the same minimum and maximum sample sizes. A thorough 
evaluation of the performance of these tests is presented, involving the power, mean number 
of stages, and the mean, 25th, 50th, and 75th percentiles of the sample size distribution under 
a wide range of alternatives, subject to the prescribed constraints on type I error probability 
and first-stage and maximum sample sizes. Section 13.11 also compares the proposed ada ptive 
test with the benchmark optimal adaptive test of Jennison and Turnbull ( l2006ai l2006d ). and 
the variance unknown case is considered in Section [221 An example from the National Heart, 
Lung and Blood Institute Coronary Intervention Study is given in Section 13.31 Section H] 
gives some concluding remarks. 



2 EFFICIENT ADAPTIVE TESTS WITH THREE 
OR FEWER STAGES 

In this section we consider one-sided tests of the null hypothesis Hq : 6 < 6q on the natural 
parameter ^ in a one-parameter exponential family fe{x) = e^^~^^^'' of densities with respect 
to some measure on the real line. Let Xi,X2, . . . denote the successive observations, and let 
Sn = Xi + . . . + Xn- A sufficient statistic based on (Xi, . . . is X„ = Sn/n, and the 
maximum likelihood estimate of ^ is 6'„ = The special case of normal X^ with 

mean 6 and known variance 1 is widely used in the literature on sample size re-estimation as 
a prototype which can be used to approximate more complicated situations via the central 
limit theorem, as in the references in Section [T] 

In practice, there is an upper bound M on the allowable sample size for a clinical 
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trial because of funding and duration constraints and because there are other trials that 
compete for patients, investigators and resources. The re-estimated sampl e size in two-stage 



desig ns has to be restricted within this bound; see, e.g., Li et al. (120021 p. 283). Lai and 



Shih (12004 p. 511) have pointed out that M implies constraints on the alternatives that 
can be considered in power calculations to determine the sample size. Specifically, by the 
Neyman-Pearson lemma, the fixed sample size (FSS) test that rejects Hq if Sm > Ca,M has 
maximal power at any alternative 6 > 6o, and in particular at the alternative 6i at which the 
FSS test has prescribed power 1 — 5. Here Ca,n denotes the critical value of the level-a FSS 
test based on a sample of size n, i.e., pig^^lSn > Ca,n} = «• Typical sample size re-estimation 
procedures in the literature (see, e.g., the references in Section [T]) first use the initial sample 
of size m, which is some fraction of M, to provide an estimate 6^ of 6 and then evaluate 
the sample size of the FSS test that has conditional power 1 — 5 given the alternative 6m, 
assuming that 9m > ^o- This results in a two-stage procedure, which does not incorporate the 
sampling variability of the estimate 6m- A simple way to make "uncertainty adjustments" 
in the above procedure that attempts to "self-tune" itself to the actual 6 value is to allow 
the possibility of not stopping at the second stage when Hq is not rejected, by including a 
third (and final) stage with total sample size M. 

2.1 An Efficient Test of Hq with At Most Three Stages 

To test Hq : 6 < 6q aX significance level a, suppose no fewer than m but no more than M 
observations are to be taken. Let 6i be the alternative "implied" by M, in the sense that 
M can be determined as the sample size of the level-a Neyman-Pearson test with power 
1 — 5 at ^1. Alternatively, 6i can be specified separately from M as a clinically relevant 
or realistic anticipated effect size based on prior experimental, observational, or theoretical 
evidence, if such information is available. A fundamental result in sequential testing theory 
is that Wald's sequential probability ratio test (SPRT) of the simple hypotheses 6 = 6' vs. 
6 = 6" has the smallest expected sample size at = and 6" am ong all tests with the same 



or smaller type I and II error probabilities; see Reference (119721 ). Moreover, letting a and a 



denote th e the type I and II error probabilities and T{6', 6") be the sample size of the SPRT, 



Chernoff (1 19721 . p. 66) has derived the approximations 

Eo>iT{6',6")) ^\\oga\/I{6",6'), Ee\T{6\6")) ^ \\oga\/ 1{6\6"), (2.1) 

where 

I{6, A) = Ee[\og{fe{X,)/h{X,)}] = {6- \W{6) - {m - ^(A)} 
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is the Kullback-Leibler information number. To test the one-sided hypothesis : 6 < Oq, 
suppose that we use the maximum hkehhood estimator 9^ from the first stage of the study in 
place of the alternative 9" in (12. ip with 9' = 9q, in the event 9^ > 9q. Then the first relation in 
(12. ip suggests that an efficient second-stage sample size would be around | loga|//(^^m5 ^o)- 
On the other hand, if 9^ < 9^, then we can consider the possibility of stopping due to 
futility by choosing 9' = 9^ and 9" = 9i in the SPRT, so the second relation in (12. ip 
suggests I log5|//(6'm, 6^1) as an efficient second-stage sample size. Adjusting for the sampling 
variability in 9m by inflating by the factor l + pm, we therefore define the second stage sample 
size 

n2 =mV{MA [(l + pJn(L)l}, (2.2) 

where pm > 0, V and A denote maximum and minimum, respectively, [x] denotes the 
smallest integer > x (and [xj denotes the largest integer < x), and 



which is an approximation to Hoeffding's (119601 ) lower bound for the expected sample size 



Ee{T) of a test that has type I error probability a at 9q and type II error probability 5 at 
9i. Note that (12. 2 p includes the cases n2 = m and 722 = M associated with using just one or 
two stages. Moreover, the stopping rule defined below by (I2.4p - (l2.6p allows the possibility of 
stopping after the first or second stage. Therefore, the actual number of stages used by the 
"three-stage" test is in fact a random variable taking the values 1, 2, 3. 



The three-stage test uses rejection and futility 



30undaries similar to those of the efficient 



group sequential tests introduced by Lai and Shih (|2004| ). Letting rii denote the total sample 



size at the ith stage, the test stops at stage i < 2 and rejects Hq if 

< M, 9n, > 9o, and nj(9^^,9o) > b, (2.4) 
where ni = m and n2 is given by (12. 2p . The test stops at stage i <2 and accepts Hq if 

n, <M, 9n,<9^, and nj(9^^,9i) >b. (2.5) 
It rejects Hq at stage z = 2 or 3 if 

ni = M, 9m>9o, and MI{9m,9o)>c, (2.6) 

accepting Hq otherwise. Letting < e,e < 1, define the thresholds b,b, and c by the 
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equations 

prp^{ (l2.5p occurs for i = 1 or 2} = ea, (2.7) 

prg^{ ( 1231) does not occur for i <2, and (12.41) occurs for z = 1 or 2} = ea, (2.8) 

pr0o{(E3D and do not occur for i < 2, and ([MD occurs} = (1 - e)a. (2.9) 

Note that (12.81) and (12. 9p imply that the type I error probability is exactly a, and we have 
found in our simulations (see Section [3]) that the power at 9i is generally close to, but slightly 
less than, 1 — 5. The values e,e aie the fractions of type I and II error probabilities "spent" 
at the first two stages, and in theory any values < e,e < 1 may be used. In practice, we 
recommend using 0.2 < e,e < 0.8 and we have found that the power and expected sample 
size of the above adaptive test vary very little with changes in e, e. In particular, the three 
examples in Section E] use e = e = 1/3, {e,e) = (1/2,3/4), and e = e = 1/2. T he fact or 
in (12. 2p is a small inflation of n{9m) to adjust for the uncertainty in 6m- Lorden (119831 ) gives 
an asymptotic upper bound for pm as a function of 6'o, 9i, a, and 5. We advocate simply 
fixing Pm to a small maximum inflation that the practitioner is comfortable with, and have 
found that pm = -05 or .1 works well in practice, which we use in the examples in Section [31 
As with M, the choice of m is often determined by practical considerations like funding and 
duration. To aid such considerations or in the absence of them, if the practitioner has bounds 
6_ < 00 and > in mind (e.g., 6 might be the largest realistic treatment effect lik ely to 
be seen), then m could be chosen to be n{9) A n{9), an approximation to Hoeffding's (1l960l ) 
lower bound for the smallest expected sample size of a test with error probabilities a, 5 at 
60,61 when 6 = 6 or 6. 

The probabilities in (I2.7p -( l2^ can be computed by Monte Carlo or recursive numerical 
integration, using normal approximations to signed-root likelihood ratio statistics. Further 
details are given in Section 12.21 The original idea to use ( 12. 2 p as the second-stage sample 
size and to allow the possibility of a third stage to account for uncertainty in the estimate 
6m (and hence n2) is due to Lorden ( 1983 ). although his test uses very conservative upper 
bounds on the error probabilities. Here we have modified Lorden's test to control the type 
I error a exactly, and provided algorithms to implement the modified test. It can be shown 
that our three-stage test is asymptotically optimal: If N is the sample size of our three-stage 
test above, then 

Ee{N) ^ m V (m A ^.^ J ^ , I (2.10) 



^ Ii6,6o)Vli6,6^jj 
as a + 5 — )■ 0, log a ~ log 5, Pm and pmy/mTlogm — i- 00; and if T is the sample size 
of any test oi Hq : 6 < 6q whose error probabilities at and 61 do not exceed a and 5, 
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respectively, then 



Ee{T) > {l + o{l))Ee{N) 



(2.11) 



simult aneously for all 9. T he pr oof uses Hoeffding's (1l960l ) lower bound for EgiT) as in 
( 19831 ) and can be found in 

Since logo; ~ log(£:Q;) as a — ?■ for any fixed < e < 1, the asymptotic formula for 
Eg{N) in fl2.10p is unchan ged if one replaces the type I error probability a by a fraction of 
it, and this is why Lorden fll983l ) can use crude bounds of the type above for the type I error 
probability. For values of the type I error probability a (e.g., .05 or .01) commonly used in 
practice, replacing a by a/10, say, can substantially increase Eg{N). Note that our adaptive 
test keeps the error probability at Oq to be a (instead of less than a) by using Monte Carlo 
or recursive numerical integration to evaluate it, discussed in the next section. 



2.2 The Normal Case and Recursive Numerical Integration 

The thresholds b, b, and c can be comput ed by solving in succession (12. 7p . (12.81) . and (12. 9p . 
Univariate grid search or Brent's method (119921 ) can be used to solve each equation. Suppose 
the Xi are A^(^, 1). Without loss of generality, we shall assume that 6q = 0. Since I {6, A) = 
(0 — A) 2/2, we can rewrite (12. 7p as 

pr,J5^-m0i < -(26m)i/2} 

(2.12) 

+ P^eA^rn - m9i > -{2bmf'^, - n^Oi < -{2bn2f'^] = ?5 
and ([23D and (EH) as 

pTo{Sm/V2^ > 6^/2} + pro{6V2 < SJV2^ < b"\ S^J > b'^^} = ea, (2.13) 
pro {6^/2 < SJV2^ < b'/^b"^ < SnjV^2 < SmI^ > c^^^} = (1 - e)a. (2.14) 

The probabilities involving n2 can be computed by conditioning on the value of Sm/ in, which 
completely determines the value of n2, denoted by k{x). For example, the probabilities under 
^ = can be computed via 

pro{5„, > {2bn2mS^ = mx} = $ ( 7,(;ff j;!,^]^ ) ' (2-15) 
WoiSn^ e dy, Sm G dz\Sm = mx} = ipk{x)-m{y - mx)ifM-k{x){z - y)dydz, (2.16) 

where $ is the standard normal c.d.f. and ip^ is the N{0,v) density function, i.e., v^t,(w) = 
(27rf )^i/2 gxp(— u'2/2f ). The probabilities under 9i can be computed similarly. Hence stan- 
dard recursive numerical integration algorithms can be used to compute the probabilities in 

(I22D-(E9D. 
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As an example, we compute the thresholds b, b, and c for the following adaptive test 
whose performance is studied in Section I3.1[ Here M = 120, a = .025, and we want the 
power to be close to 1 — 5 = .9 at = ^^i = .3. Setting e = e = 1/3 and = .1, we first 
find b by solving (12.121) . which can be written as 

by the analog of ( ]2.15p for 6 = 6i, where ip^ is as in ( 12.16^ . The integral is computed by 
numerical integration and a few iterations of the bisection method gives b = 1.99. This value 
is next used to find b similarly by solving fl2.13p . which can be written as 

/mx-[2bk{x)Y/^\ . , , 0.025 



^/ ro.ii/2^ r^"'^'\fmx-[2bk{x)Y/^\ , , ^ 
J\2b/mW2 V [k{x)-m]y^ J 
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by fl2.15p . The bisection method gives b = 3.26, which we in turn use to find c by solving 
(EUD, which is 

cM]i/2_y \ X . X , , N 0.05 

'fk{x)-m{y - mx)Lpm[mx)mdydx = (1 - e)a = — — 



p[2b/m]^/^ 


^[2fefc(x)]V2 




1 * 


/[2fe/m]i/2 


/[2fefe(x)]V2 \ 



[M- A;(x)]i/2 
by ( I2.16p . giving c = 2.05. 



2.3 Multiparameter Extension 



Suppose Xi,X2, . . . are independent (i-dimensional random vectors from a multiparameter 
exponential family fe{x) = exp{6'^x — iIj{6)} of densities. The three-stage test in Section [271] 
can be readily extended to test Hq : u{6) < uo, where u is any smooth real- valued function. 



As in Section 12.11 rii = m and = M. The stopping rule of the three-stage test of 
Hq : u{6) < uq is the same as fl2.4p - fl2.6p but with nl{6n, 6j) replaced by 

inf nl(en,e), (2.17) 

e:uid)=Uj 

j = 0,1, where ui > uq is the alternati ve imp lied by the maximum sample size M and the 



desired type II error probability a; see (1200J, Section 3.4). In particular, the test stops and 
rejects Hq at stage z < 2 if 

ni<M, u(9nJ>uo, and inf nil(9n,,9) > b, (2.18) 

6:u{d)=uo 

which is analogous to (12. 4p . Early stopping for futility (accepting Hq) can also occur clt StcL£[[C 
i < 2 if 

ni<M, u(^„J<ui, and inf nJidn^e) >b, (2.19) 

6:u{9)=ui 
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which is analogous to (12. 5p . The test rejects Hq at stage z = 2 or 3 if 



rii = M, u{eM) > Uq, and inf MI^Om, 0) > c, (2.20) 

9:u{9)=uo 

accepting Hq otherwise. The thresholds b, b, and c are chosen to ensure certain type I and 
type II error probability constraints that are similar to (I2.7p - (l2.9p and are computed by using 
the normal approximation to the signed-root likelihood ratio statistic 

C(5)=n{sign(M(^,)-(5)}{2 inf /(^n,^)}^/^ 

9:u(9)=8 



under the hypothesis u{6) = 6; see ( 12004| . p. 513). Note that this normal approximation can 
be used for the choice of Ui implied by the maximum sample size M and the type II error 
probability 5. The sample size n2 of the three-stage test is given by (12.21) with 

n(^) =min{|loga|/ inf 7(0, A), | log5|/ inf 1(6, X)}, (2.21) 

A:n(A)=uo X:u{X)=ui 

which is a generalization of (12. 3p . Examples of the multiparameter case are given in Sec- 
tion 13.21 for normally distributed data with unknown variance, and in Section 13.31 for two 
binomial populations. 



3 COMPARISON WITH OTHER TESTS 

3.1 Normal Mean with Known Variance 

We consider the special case of normal Xi with unknown mean 6 and known variance 1, and 
compare a variety of adaptive tests of Hq : 6 < in the literature with the tests proposed 
in Section 12.11 In this normal setting, 6n = X„ and I{0,X) = [9 — A)^/2. It is widely 
recognized that the performance of adaptive tests is difficult to evaluate and compare because 
it depends heavily on the choice of first-stage and maximum sample sizes, the number of 
groups (stages) allowed, and the parameter values at which the tests are evaluated. For this 
reason, the tests evaluated here use the same first-stage and maximum sample sizes, except 
for a few illustrative examples discussed below. In addition, we report a variety of operating 
characteristics for each test - power, mean number of stages, and the 25th, 50th, and 75th 
percentiles in addition to the mean of the sample size distribution - over a wide range of 
9 values. A comprehensive evaluation of adaptive and group sequential tests like this has 
not appeared previously in the literature. We also include the uniformly most powerful FSS 
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test with the same maximum sample size and type I error probabihty a, which provides 
the appropriate benchmark for the power of any test of Hq. Another relevant comparison 
- especially given their widespread use in clinical trials - made here is with standard (non- 
adaptive) group sequential tests having a similar number of stages as the adaptive test. 



To test Hq : 6 < 0, Proschan and Hunsberger (Il995l ) proposed a two-stage test, based on 
the conditional power criterion, which uses the usual ^-statistic but with a data-dependent 
critical value to maintain the type I error at a prescribed level a. The test allows early stop- 
ping to accept (or reject) the null hypothesis if the test statistic is below a user-specified upper 
normal quantile Zp* (or above some level k) at the end of the first stage. Choosing a data- 
dependent critical value is tantamount to multi plying the z-statistic by a data-dependent 



factor and using a fixed critical value. Li et al. (120021 ) proposed to use the ^-statistic with 
a fixed critical value c, while still determining the second-stage sample size by conditional 
power and maintaining the type I error at a. Their test stops after the first stage if the 
test statistic falls below h or above k. For each h and condition al pow er level, their test 



has a maximum allowable k, which they denote by k\{h). Fisher ( 119981 ) proposed a "vari- 
ance spending" method for weighting the observations so that the type I error of his test 
does not exceed a, despite its data-dependent second-stage sample size that is given by the 
conditional power criterion. To avoid a very large second-sta ge sam ple size if the first-stage 



estimate of 6 lies near the null hypothesis, Shen and Fisher (119991 ) proposed early stopping 
due to futility whenever the upper 100(1 — ao)% confidence bound for 6 falls below some 
specified alternative 6'i > 

Table 1 compares these tests, a FSS test, and two standard group sequential tests with 
the adaptive test described in Section 12. 1[ The values of the user-specified parameters of 
the tests are summarized in the list below. The user-specified parameters are chosen so 
that they have the same first-stage sample size m = 40 (except for the FSS test), maximum 
sample size M = 120 (except for SF'; see the last paragraph of this section), type I error not 
exceeding a = .025, and nominal power (or conditional power level in the case of conditional 
power tests) equal to .9. 

• ADAPT: The adaptive test described in Section l2?n that uses b = 3.26, b = 1.99, and 
c = 2.05 corresponding to £ = £ = 1/3 in (I2.7p - (I2.9I) . and Pm = -1 (see Section for 
details). 

• FSS120: The FSS test having sample size 120. 



OBF pp, OBF sc- O'Brien and Fleming's (119791 ) one-sided group sequential tests having 
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three groups of size 40. OBFpj? uses power family futility stopping (A = 1 in (2000, 



Section 4.2)) and OBF5C7 uses stochastic curtailment futility stopping (7 = .9 in ( 12000 



Section 10.2)). Both OBFpp and OBF^c* use reference alternative 61 = .3; see below. 



PH: Proschan and Hunsberger's (119951 ) test that uses p* = .0436 and k = 2.05. 



L: Li et al.'s (I2002h test that uses h = 1.63 and k = klQi) = 2.83. 



SF, SF': Two versions of Shen and Fisher's (119991 ) test; SF uses ao = .425 and SF' 
uses tto = -154. 



The tests are evaluated at the 6 values where FSS120 has power .01, .025, .6, .8, .9, .95, 
and at ^ = .15, the midpoint of 6* = and 6 = 9i = .3, the alternative implied by M = 120 
since FSS120 has power 1 — 5 = .9 there. This is also the alternative used by the OBF 
tests for futility stopping. Each entry in Table 1 is computed by Monte Carlo simulation 
with 100,000 replications. To compare tests T, T' with type I error probability a but with 
different type II error probabi lities a riO), 5t'{0) and expected sample sizes EgT, EgT' at 
6 > 0, Jennison and TurnbuU (j2006al ) defined the efficiency ratio of T to T': 

fl.(r.r-)^ '\+--""';^f!:, xioo. (3.1) 

noting that {za + Zaj,{e)Y /O"^ is the sample size of the FSS test with the same type I error 
probability and power as T. Table 1 contains Rg{T,N) for all tests T and 6' > 0, where 
is the sample size of ADAPT. 



INSERT TABLE 1 ABOUT HERE 



ADAPT has power comparable to FSS120 at all values of 9 while achieving substantial 
savings in sample size, as shown by the percentiles and mean of the sample size. The three- 
stage OBF tests have power comparable to ADAPT and FSS120, but ADAPT has sample 
size savings over the OBF tests, especially for larger 6* > 0, reflected by the efficiency ratio. 
The mean number of stages (denoted by ^) reveals that although ADAPT allows for the 
possibility of three stages, most frequently it uses only one or two stages. 

The conditional power tests PH, L, SF, and SF' are underpowered at values of 6* > in 
Table 1. In particular, PH, L, and SF all have power less than .6 at 9i = .3, where ADAPT, 
FSS120, and the OBF tests have power around .9. The lack of power of PH, L, and SF shown 
by Table 1 is caused by stopping too early for futihty. For example, the PH test stops for 
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futility after the first stage if SmI ^Jrn falls below Zp* = 1.71. But Wei{Sm/ < 1-71} = .44, 
well exceeding the nominal type II error of .1. On the other hand, such stringent futility 
stopping is necessary to control the sample size of conditional power tests. For example, the 
.025-level PH test that stops for futility only when 6m ^ ^ (i-e., with p* = .5) has expected 
sample size greater than 10'' at all values of 6 in Table 1, yet power less than .9 at 6i. SF 
and SF' provide another example of this behavior. Since these tests stop for futility at the 
first stage when Sm/f^i < 6i — Zao/\^, the choice of determines the maximum sample 
size. For maximum sample size M = 120, SF uses = -425, a high rate of first-stage 
futility stopping which results in small expected sample sizes, low power, and a reduced type 
I error of .012, which is a = .025 in the absence of futility stopping. In contrast, SF' uses 
less stringent futility stopping with oq = .154 that corresponds to maximum sample size 
5M = 600, which results in a type I error closer to .025 and better power, though it is still 
underpowered and its expected sample size exceeds 120 at .2 < 6^ < .26. The smallest 
that does not perturb the type I error of .025 of Shen and Fisher's test is ao = .039, but the 
resultant test has expected sample size 1856 at = and maximum sample size 52341. 



INSERT TABLE 2 ABOUT HERE 



The efficiency ratios relative to ADAPT in Table 1 are all less than 100 with the exception 
of PH, L, and SF at 6' = .15, but it is not clear that the efficiency ratio has much meaning 
in this case where the power of these tests is so low. For the other cases, it is natural to ask 
if much more improvement is possible. A benchmark for a nswering this question is provided 
by the optimal adaptive tests of Jennison and Turnbull (l2006al : l2006d ) that minimize the 
expected sample size averaged over a collection of 6 values, subject to a given type I error 
probability and power level at a prespecified alternative 6' . Table 2 contains the expected 
sample size of T^, the fc-stage test minimizing 



[Eo{T) + EeiT) + E2e'iT)]/3 



(3.2) 



among all fc-stage tests with maximum sample size M = 120, type I erro r prob ability a = .025 
and power .8 at 6', the alternative where FSSioo has power .8, from (j2006d . Table HI). To 
this benchmark we compare ADAPT with the same first group size m = 29 as Tg*, M = 120, 
6i fixed at 9', and b = 2.94, b = .7, and c = 2.05 corresponding to e = 1/2, e = 3/4. 
Also included in Table 2 is the optimal fc-stage "p-family" group sequential test (denoted by 
OGS(/c)) with M = 120, groups 2, . . . , fc of size (M — m) /{k — 1), and with m and p chosen to 
minimize (13. 2p . Jennison and Turnbull (l2006d ) concluded that OGS(A;) is a computationally 
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easier alternative to T^, and Table 2 shows that their expected sample sizes are close at 
9 = 0,6', 26. Note that ADAPT has expected sample size close to 0GS(3) and Tg* even 
though the probability that ADAPT uses only 1 or 2 stages is 96.4%, 83.1%, and 98.4% for 
6 = 0,6', and 26', respectively, showing that ADAPT very often behaves like a 2-stage test. 
ADAPT has substantially smaller expected sample size than Tg* and 0GS(2), however. On 
the other hand, is more efficient than ADAPT but this is due in part to its smaller first 
group of m = 24, afforded by its additional stage. Here we have matched the first group 
m = 29 of ADAPT to the that of T3* for the purpose of comparison, but in practice there is 
flexibility in its choice of m. The and OGS(/c) tests, on the other hand, are rigid in their 
choice of m that is determined by dynamic programming from the prespecified alternative 
6', about which there may be some uncertainty before the trial. 



Lokhnygina (l2004j ) . who considers somewhat different objective functions than (13. 2p . has 
computed and plotted the data- dependent total sample size of the optimal 2-stage design as 
a function of the first stage sample mean X^- Her results show the total sample size to be 
a unimodal function of X^, peaking between and 6'. For comparison. Figure 1 plots the 
function n{6) (12. 3p in the sample size updatin g rule (12. 2p of ADAPT for the setting of Table 2. 



A similar shape is exhibited by Figure 2.2 of ( 120041 ) on the total sample size function (which 



is m plus the second-stage sample size) of the optimal two-stage test. This is not surprising 



because to be optimal, the expected sample size cannot differ much from Hoeffding's (Il960l ) 



lower bound, of which n{6) is a close approximation. Figure 1 differs dramatically from 
the total sample size function of any untruncated two-stage conditiona 



increases to infinity as 6^ approaches 0. Jennison and Turnbull ( 12006b 



power rule which 
p. 672) have also 

pointed this out and suggested that this is a source of inefficiency of two-stage conditional 
power tests. 

INSERT FIGURE 1 ABOUT HERE 



3.2 Case of Unknown Variance 

The optimal adaptive test and the optimal group sequential tests OGS(fc) in Table 2 
require the variance of the observations to be known. As pointed out above, in practice 
there is often little information about the sampling variability before the trial. Dynamic 
programming is difficult to carry out for the optimal adaptive test when the Xi,X2, . . . are 
i.i.d. N{fi, 0"^) and both fi and a are unknown, and no analog of has been developed in this 
setting. However, the optimal group sequential tests OGS(A;) in Table 2 can be modified for 
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the present setti ng by applying their error spending functions to the sequential t-statistics, 
as described in (120001 Section 11.5), which are denoted by OGS*(A;). In this section we 
compare ADAPT with OGS*(A;) and other tests for a normal mean when the variance is 
unknown. In the notation of Section [2l3| 9 = {fi, cr)^, u{6) = fi, uq = 0, and the generalized 
likelihood ratio statistic fl2.17p is 

1 + 



{n/2) log 



0"r, 



where an is the MLE of a. Denne and Jennison (120001 ) proposed an adaptive group-sequential 
extension of Stein's (Il945l ) 2-stage t-test in which the total sample size and stopping bound- 
aries are updated at each stage as a function of the current estimate of a . Lai and Shih kopj ) 
introduced tests of composite hypotheses for a multiparameter exponential family which use 
the same stopping rule (I2.18p - (l2.20p as ADAPT but with prespecified group sizes. The ex- 
pected number of stages, power and expected sample size of these tests are given in Table 3 
at various (/i, cr) values. All tests use m = 34 (with the exception of 0GS*(4)) and M = 120 
(with the exception of DJ which has unbounded maximum sample size), the first stage and 
maximum sample sizes of 0GS(3) in Section [3.11 and nominal power levels a = .025 and 
1 — 5 = .8 at (/i, 0") = (0,1) and (6*^1), respectively, where 9' is as in Section [XT] Other 
values of the user-specified parameters of the tests are listed below. 



ADAPT: The adaptive test described in Section 12.31 with ui fixed at 6', b = 2.49, 
b = .59 and c = 2.7 corresponding to e = 1/2, e = 3/4. 



OGS*(/c): Jennison and TurnbuU's (120001 . Section 11.5) group sequential t-test with 
k groups and the same m and error spending function as that of OGS(fc) in Table 2: 
0GS*(3) uses p = .99 and group sizes 34, 43, 43; 0GS*(4) uses p = 1.13 and group 



sizes 29, 30, 30, 31. 



DJ: The adaptive 3-stage t-test of Denne and Jennison (|2000[ ) with p = .99, to match 
0GS*(3). 



LS: Lai and Shih's (12004 Section 3.4) group sequential test with group sizes 34, 43, 
43, so that m = 34 and M = 120. 



INSERT TABLE 3 ABOUT HERE 



When (7 = 1, ADAPT, OGS*, and DJ have similar power and expected sample size 
properties, with ADAPT having the smallest expected number of stages and smaller expected 
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sample size than 0GS*(3). LS has the highest power of the five tests, but highest expected 
sample size too. When a < 1 and /i = 0, ADAPT has substantially smaller expected sample 
size than 0GS*(3) and even 0GS*(4). DJ has similar operating characteristics to ADAPT 
and LS when a < 1. However, when a > 1, the expected sample size of DJ becomes much 
larger than those of other tests because its total sample size is chosen to be proportional 
to the estimate of cr^ at the end of the previous stage. In all cases evaluated, ADAPT has 
the smallest expected number of stages, less than 2 in each case, showing that it most often 
behaves like a FSS or 2-stage test, as in the variance known setting of Table 1. 



3.3 Coronary Intervention Study 



The National Heart, Lung and Blood Institute (NHLBI) Type II Coronary Intervention 
Study (Il982i ) was designed to investigate the cholesterol-lowering affects of cholesytyramine 
on patients with Type II hyperlipoproteinemia and coronary artery disease. Patients were 
randomized into cholesytyramine and placebo groups, and coronary angiography was per- 
formed before and after five years of treatment. It was found that the disease had progressed 
in 20 of 57 i n the placebo group a nd 15 of 59 in the cholesytyramine group. Proschan and 
Hunsberger (119951 ) and Li et al. (|2002| ) have considered how this study could have been 
extended by using their two-stage tests for the difference in two normal means with common 
unknown variance. To apply these tests to the NHLBI study, they assumed the first-stage 
sample size to be 58 = (57-|-59)/2 for the normal problem and used the arcsine transforma- 
tion so that the difference between the transformed binomial frequencies, pi for the placebo 
group and p2 for the treatment group, is approximately normally distributed; details are 
given in the next paragraph. As an alternative we apply the three-stage test in Section 12.31 



to two binomial populations. In the notation of Section 12. 3[ to test Hq : p2 < pi we have 
6 = {pi,P2)^ , u{6) = p2 —pi, uq = 0, and the test statistic inf g,u{e)=5 nl{6n, 6) in fl2.18l) - fl2.20l) 
takes the form 



n { pi^n log 



qi,n log 



P2,n log 



+ '?2,n log 



where pi^n is the maximum likelihood estimator of pi based on n observations, qi^n = 1 —Pi,n, 
and ps^n is the maximum likelihood estimator of pi under the assumption P2 — Pi = S. The 
treatment and placebo groups are assumed to have t he sam e per-group s ample size during 
interim analyses, following Proschan and Hunsberger (119951 ) and Li et al. (120021 ). 

Letting Sn denote the sum of independent normal random variables with mean fi and 
variance 1, following a pilot study of size m resulting in Sm = ■ 



Proschan and Huns- 
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berger's (119951 ) test chooses n2 and critical value c to satisfy the conditional power criterion 

YtliSn^/nJ"^ > c\Sm = Sm,fJ' = Sm/m^''^} > 1 - 5 (3.3) 



and type I error constraint 



pro{S'„2/?22^^ > c|S'„ 



a. 



(3.4) 



In order to solve for n2 and c, a parametric form for the probability in (13. 4p is assumed, 
which co ntains a user-specified futility boundary h and critical value k f or the internal pilot. 
Li et al. (120021 ) introduce a modification of Proschan and Hunsberger's (119951 ) test in which 
the critical value c is specified before the internal pilot study but h, k, and n2 are chosen to 
satisfy (13.31) and (13.41) after the internal pilot study. This modification allows approximations 
to the probabilities in (13. 3p and (13. 4p to be used in lieu of a spe cific parametric form. For the 
coronary intervention study, Proschan and Hunsberger (11995! ) and Li et al. (120021 ) propose 
using these tests with the variance-stabilizing transformation S'„, = (2n)-'^/^{arcsin(py^) — 
arcsin(p2fn)}. 

INSERT TABLE 4 ABOUT HERE 



Table 4 gives the power, per-group expected sample size, and efficiency ratio (13. ip . using 
the normal approximation, relative to ADAPT (for alternatives p2 > Pi) of the following 



tests for various va^ 



the NHLBI study (Il982[ ). 



ues o f pi,P2 near 15/59 = .254 and 20/57 = .351, the values observed in 



L: Li et al.'s (|2002| ) test with h = 1.036, k = 1.82, c = 1.7, a = .05, conditional power 
level .8, and first- stage size m = 58. 



PH: Proschan and Hunsberger's (119951 ) test with h = 1.036, k = 1.82, a 



.05, 



conditional power level .8, and first-stage size m = 58. 



• ADAPT: The adaptive test described in a previous paragraph with m = 58, M = 302 
(the maximum sample size of L), and thresholds b = 2.36, b = 1.1, and c = 1.55 
corresponding to a = .05, a = .2, and e = e = 1/2. 

All three tests use the same first-stage size m = 58. ADAPT matches the maximum sample 
size M = 302 of L, and the parameters of PH determine its maximum sample size to be 
shghtly larger at 354. The actual power of L and PH is around 50% for the values of pi and 
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P2 in Table 4 with P2 — Pi = -1, and is less than 50% when pi = .254 and p2 = -351 where 
they were designed to have conditional power 80%. This is caused in part by premature 
stopping for futility at the end of the first stage. Indeed, L and PH use the same futility 
boundary and their probability of stopping at the end of the first stage when pi = .254 and 
P2 = .351 is .47, well exceeding the nominal Type II error probability .2. One might ask if a 
conditional power test can avoid this phenomenon by using a larger first-stage sample size 
so that the estimate p2 — pi is near less often after the first stage when the true difference 
P2 — pi is substantially greater than 0. If the first-stage sample size of L is raised to 162 
(raising the maximum sample size to 1331), the resultant test has power 79% when pi = .254 
and p2 = .351, approximately equal the power of ADAPT. However, the expected sample 
size of this version of L is 264 at this alternative, compared to the expected sample size 
213.1 of ADAPT. Similar oversampling also occurs for the values of pi and p2 in Table 4 
with P2 — Pi > -1, where the power of L and PH is closer to the nominal conditional power 
level of 80%, but the efficiency ratio drops to around 75%. 



4 DISCUSSION 

Most previous works in the literature on adaptive design of clinical trials and mid-course 
sample size adjustments have focused on two-stage designs whose second-stage sample size 
is determined by the results from the first stage using conditional power. Although this 
approach is intuitively appealing, it does not adjust for the uncertainty in the first-stage 
parameter estimates that are used to determine the second-stage sample size. This can result 



in substantial power lo ss, as shown in Section 13.11 Although Jennison and TurnbuU (j2003[ ) 



and Tsiatis and Mehta (120031 ) have pointed out the inefficiency of this approach and advocate 
instead using group sequential designs, their critique focuses on the use of non-sufficient 
"weighted" test statistics and variability in the interim estimate. Through our extensive 
simulation studies we have shown that another problem with conditional power methods in 
practice is potential lack of power, which results from the difficulty in bridging conditional 
power with actual power and in choosing a futility stopping rule. 



In their recent survey of adaptive designs, Burman and Sonesson ( 120061 ) pointed out 
that previous criticisms of the statistical principles and properties of these designs may be 
unconvincing in some situations when flexibility and not having to specify parameters that 
are unknown at the beginning of a trial (like the relevant treatment effect or variance) are 
more imperative than efficiency or being powerful, whereas most efficient group sequential 
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designs require the prespecification of the relevant alt ernativ e and v ariance, as in the case of 
the optimal adaptive tests of Jennison and Turnbull (j2006al : l2006d ). Moreover, conditional 
power tests are easy to implement while optimal adaptive tests require substantial dynamic 
programming computations. The adaptive tests of Section [2] combine the attractive features 
of both the conditional power and group sequential tests. Rather than achieving exact opti- 
mality at a specified collection of alternatives through dynamic programming, they achieve 
asymptotic optimality over the entire range of alternatives, resulting in near-optimality in 
practice; see Section ISTTl These tests are based on efficient generalized likehhood ratio statis- 
tics which have an intuitively "adaptive" appeal via estimation of unknown parameters by 
maximum likelihood, ease of implementation, and freedom from having to specify the rel- 
evant alternative (through the implied alternative) that conditional power tests enjoy. As 



shown in Section |2T3| these generalized likelihood ratio statistics and the associated adaptive 
tests can be readily extended to multiparameter settings with nuisance parameters and they 
enjoy near-optimality in these more complicated and realistic settings as well; see Sections 13. 21 
and 13. 3[ 

?he p ossibility of adding a third stage to improve two-stage designs dated back to Lor- 



den (119831 ). who used upper bounds for the type I error probability that are overly conser- 



vative for applications to clinical trials, which need to maintain the type I error probability 
of the test at a prescribed level because of regulatory and publication requirements; see the 
references in Section [1] We have modified Lorden's three-stage test by c ombin ing its basic 
features to preserve its asymptotic optimality with those of Lai and Shih (120041 ) for efficient 
group sequential designs. The adaptive test in Section |2] makes use of the maximum sample 
size M to come up with an implied alternative which is used to choose the rejection and 
futility boundaries appropriately so that the test does not lose much power in comparison 
with the (most powerful) FSS test of the null hypothesis versus the implied alternative. This 
idea has led to the superior power properties of ADAPT in Table 1, comparable to those of 
the FSS test. Moreover, the expected number of stages of ADAPT in Table 1 ranges from 
1.5 to 2.07 and is less than 2 for all cases in Table 3. Therefore ADAPT is not much less 
convenient to run than the FSS test (with only 1 stage), in contrast with group sequential 
tests with 3, 4, 5 or more interim analyses of the accumulated data. In practical terms, this 
can provide substantial savings in the operational costs of the trial by eliminating the need 
for data monitoring at interim analyses since the updated sample size and stopping rule rule 
are completely determined at the end of the pilot stage. 

On the other hand, there are situations where adding an additional stage or increasing 
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the maximum sample size may be de sired, as pointed out by Cui, Huang, and Wang (119991 ) 
and Lehmacher and Wassmer (119991 ) . For example, (119991 ) cites a study protocol, which 
was reviewed by the Food and Drug Administration, involving a Phase III group sequential 
trial for evaluating the efficacy of a new drug to prevent myocardial infarction in patients 
undergoing coronary artery bypass graft surgery. During interim analysis, the observed 
incidence for the drug achieved a reduction that was only half of the target reduction assumed 
in the calculation of the maximum sample size M, resulting in a proposal to increase the 
maximum sample size to A'max- The basic idea underlying the proposed test in Section [2] can 
be easily modified to allow increase of the maximum sample size from M to no more than 
A^max after the second stage, resulting in a test with at most four stages. The type I error 
probability of the modified test can be computed numerically by recursive integration or by 
Monte Carlo simulations, as described in Section |2j 
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